# `lsup_rdf`

**This project is work in progress.**

Embedded RDF (and maybe later, generic graph) store and manipulation library.

## Purpose

The goal of this library is to provide efficient and compact handling of RDF
data. At least a complete C API and Python bindings are planned.

This library can be thought of as SQLite or BerkeleyDB for graphs. It can be
embedded directly in a program and store persistent data without the need of
running a server. In addition, `lsup_rdf` can perform in-memory graph
operations such as validation, de/serialization, boolean operations, lookup,
etc.

Two graph back ends are available: a memory one based on hash maps and a
disk-based one based on [LMDB](https://symas.com/lmdb/), an extremely fast and
compact embedded key-store value. Graphs can be created independently with
either back end within the same program. Triples in the persistent back end are
fully indexed and optimized for a balance of lookup speed, data compactness,
and write performance (in order of importance).

This library was initially meant to replace RDFLib dependency and Cython code
in [Lakesuperior](https://notabug.org/scossu/lakesuperior) in an effort to
reduce code clutter and speed up RDF handling; it is now a project for an
independent RDF library, but unless the contributor base expands, it will
remain focused on serving Lakesuperior.


## Development Status

**Alpha.** The API structure is not yet stable and may change radically. The
code may not compile, or throw a fit when run. Testing is minimal. At the
moment this project is only intended for curious developers and researchers.

This is also my first stab at writing a C library (coming from Python) and an
unpaid fun project, so don't be surprised if you find some gross stuff.


## Road Map

### In Scope – Short Term

The short-term goal is to support usage in Lakesuperior and a workable set
of features as a standalone library:

- Handling of graphs, triples, terms
- Memory- and disk-backed (persistent) graph storage
- Contexts (disk-backed only)
- Handling of blank nodes
- Namespace prefixes
- Validation of literal and URI terms
- Validation of RDF triples
- Fast graph lookup using matching patterns
- Graph boolean operations
- Serialization and de-serialization to/from N-Triples and N-Quads
- Serialization and de-serialization to/from Turtle and TriG
- Compile-time configuration of max graph size (efficiency vs. capacity)
- Python bindings
- Basic command line utilities

### Possibly In scope – Long Term

- Binary serialization and hashing of graphs
- Binary protocol for synchronizing remote replicas
- Backend for massive distributed storage (possibly Ceph)
- Lua bindings

### Likely Out of Scope

(Unless provided and maintained by external contributors)

- C++ bindings
- JSON-LD de/serialization
- SPARQL queries (We'll see... Will definitely need help)

## Building

### Requirements

- It is recommended to build and run LSUP_RDF on a Linux system. No other
  OS has been tested so far.
- A C compiler. This has been only tested with `gcc` so far.
- [re2c](https://re2c.org/) to build the RDF language lexers.
- [cinclude2dot](https://www.flourish.org/cinclude2dot) and
  [Graphviz](https://graphviz.org/) for generating dependency graph (optional).


### `make` commands

The default `make` command compiles the library. Enter `make help` to get an
overview of the other available commands.

`make install` installs libraries and headers in the directories set by the
environment variable `$PREFIX`. If this is unset, the default `/usr/local`
prefix is used.

Options to compile with debug symbols are available.


### Compile-Time defines (`-D[...]`)

`DEBUG`: Set debug mode: memory map is at reduced size, logging is forced to
TRACE level, etc.

`LSUP_RDF_STREAM_CHUNK_SIZE`: Size of RDF decoding buffer, i.e., maximum size
of a chunk of RDF data fed to the parser when decoding a RDF file into a graph.
This should be larger than the maximum expected size of a single term in your
RDF source. The default value is 8192, which is mildly conservative. If you
experience parsing errors on decoding, and they happen to be on a term such a
very long string literal, try recompiling the library with a larger value.

## Embedding

The generated `liblsuprdf.so` and `liblsuprdf.a` libraries can be linked
dynamically or statically to your code. Only the `lsup_rdf.h` header, which
recursively includes other headers in the `include` directory, needs to be
`#include`d in the embedding code.

Environment variables and/or compiler options might have to be set in order to
find the dynamic libraries and headers in their install locations.

For compilation and linking examples, refer to `test`, `memtest`, `perftest`
and other actions in the current Makefile.


### Environment Variables

`LSUP_MDB_STORE_PATH`: The file path for the persistent store back end. For
production use it is strongly recommended to set this to a permanent location
on the fastest storage volume available. If unset, the current directory will
be used. The directory must exist.

`LSUP_LOGLEVEL`: A number between 0 and 5, corresponding to:

- 0: `TRACE`
- 1: `DEBUG`
- 2: `INFO`
- 3: `WARN`
- 4: `ERROR`
- 5: `FATAL`

If unspecified, it is set to 3.

`LSUP_MDB_MAPSIZE` Virtual memory map size. It is recommended to leave this
alone. By default, it is set to 1Tb for 64-bit systems and 4Gb for 32-bit
systems. The map size by itself does not use up any extra resources.


### C API Documentation

Almost all header files are documented. Run `doxygen` (see
[Doxygen](https://www.doxygen.nl/index.html)) to generate HTML documentation in
`docs/html`.


### Python API Documentation

*TODO*
